Unlabelled Data Processors #87

monica-sekoyan · 2024-10-07T13:04:44Z

Added processor for Babel Dataset (language independent)
Added processor for Voxpopuli Dataset Unlabellet subset (language independent)
Added generic config for yodas (or dataset alike) data processing
Added processors for audio segmentation, untarring audios, emojis removal
Added corresponding new tests
Corrected Armenian audio books test data in the s3 bucket (was failing because of the incorrect reference)

p.s.
all tests are passed locally

Signed-off-by: monica-sekoyan <[email protected]>

monica-sekoyan added 8 commits October 5, 2024 19:22

add babel processors

8abc0aa

Signed-off-by: monica-sekoyan <[email protected]>

addiing data modif processors -s

be88d1f

Signed-off-by: monica-sekoyan <[email protected]>

adding corrupted data remover processor

5cbd01e

Signed-off-by: monica-sekoyan <[email protected]>

updating transcribe_speech

33154da

Signed-off-by: monica-sekoyan <[email protected]>

adding processors to init

777f34a

Signed-off-by: monica-sekoyan <[email protected]>

add basic processor for yodas

e2a39af

Signed-off-by: monica-sekoyan <[email protected]>

add tests for new processors

395d430

Signed-off-by: monica-sekoyan <[email protected]>

add new processors to docs

82adfbd

Signed-off-by: monica-sekoyan <[email protected]>

monica-sekoyan requested review from erastorgueva-nv and karpnv October 7, 2024 13:29

monica-sekoyan added 5 commits October 9, 2024 16:13

modifying tests

068b951

Signed-off-by: monica-sekoyan <[email protected]>

add pydub

f06a94d

Signed-off-by: monica-sekoyan <[email protected]>

small fix

98d4ff0

Signed-off-by: monica-sekoyan <[email protected]>

update numpy version

0fb7d8f

Signed-off-by: monica-sekoyan <[email protected]>

restored the old version of transcribe_speech

f782eee

Signed-off-by: monica-sekoyan <[email protected]>

Provide feedback